course logo
MODULE 3
INTRO
PAGE 1
Spam Indexing
PAGE 2
Mark Up
PAGE 3
MetaTags
PAGE 4
SGML/XML
PAGE 5
Dublin Core
PAGE 6
PICS
PAGE 7
RDF

to home page....................site map

Copyright © 2002 Wallace Koehler - All Rights Reserved

Text Mark Up Languages

 

 

Page Index
Examples
    Head
    Body
WWW Mark Up Records
    IAFA Templates
    ALIWEB
    Harvest/SOIF
    ONIX
    Digital Geospatial Metadata
Archival Mark Up
    Corpus Encoding Standard
    Model Editions Partnership
    Encoded Archival Description  DTD
    Text Encoding Initiative
 
Site Index
Metatags
Dublin Core
Spam Indexing
W3C - RDF and PICS

Mark up languages vary greatly in complexity and use. They all have one thing in common: they are to be used in the electronic digital environmnet. As we know, that is a large environment that extends well beyond the Web. We explore some of the non-Web mark up languages because they could be migrated to Web applications. Or, they may give rise or inspiration to those applications.

An excllent descriptive resource for mark up languages can be found at http://ukoln.ac.uk/metadata/desire/overview/rev_toc.htm

Examples

Walk through the examples given here before you try to tackle the various mark up languages. These are designed to go from fairly simple to complex to demonstrate that (a) the concepts underlying mark up languages are fairly simple but (b) they get very confusing real fast. The more simple the language, I would suggest, the more likely Web creators would be willing to employ them themselves. In the absense of simplicity, an easy template might do. To ask anyone to encode just a few documents rarely with a complex set of tags invites disaster or refusal.

In the Head

This is by far the most coomon approach to electronic mark up for all kinds of documents and for Web documents. The standard and Dublin Core metatags are examples. The header for any html document contains at minimum how and when the document was created and mark up language used (html as a rule). It may also contain authority, rights, and other information. This is the information that is placed between the <head></head> notation, hence "headers." Typically, we do not see the header information, but the computer does. We can it to show us the header information through a view page source command.

A minimum for this page it looks like:
 
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.7 [en] (Win95; U) [Netscape]">
   <meta name="Author" content="wallace koehler">
   <title>Mark Up Made Easy</title>
<!--This file created 12:01 PM  2/20/00 by Claris Home Page version 3.0-->
<X-CLARIS-WINDOW TOP=0 BOTTOM=435 LEFT=0 RIGHT=788>
<<!--This file created 12:01 PM </head>

As we have already seen, a wide range of metatags can be added to provide far more information to catalogers, search engine spiders, and others seeking to define the information content of the page.

As promised, a more complex set of headers. These are the Dublin Core headers for the Dublin Core page in this document and these are repeated from that page. Note first that everything lies between the <head>s. Each Dublin Core metatag consists of three parts (1) the "meta name, (2) a schema statement, and (3) a reference. Each metatag defines document data of some kind (title or creator name, for example). The schema defines the scheme used. The reference provides the reader with the justification and definition for the tag.

Other header systems may or may not provide this degree of documentation within the header. Dublin Core is one of the simpler of the formal mark up systems, as you will see as you explore others listed below.
 
<head> 
               <title> Web Management: Dublin Core </title> 
               <META NAME="DC.Title" CONTENT="Web Document Management"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#title"> 

               <META NAME="DC.Title.Alternative" CONTENT="LIS 5990 Summer 2000"> 
               <LINK
               REL=SCHEMA.dc HREF="http://purl.org/metadata/dublin_core_elements#title"> 

               <META NAME="DC.Creator.PersonalName" CONTENT="Koehler, Wallace"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#creator"> 

               <META NAME="DC.Creator.PersonalName.Address"
               CONTENT="wkoehler@ou.edu"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#creator"> 

               <META NAME="DC.Subject" CONTENT="metadata"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="bibliographic control"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="Web sites"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="WWW"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Dublin Core,
               metadata, resource discovery"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Internet"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Library Catalogs
               and Bulletins"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Indexing,
               Abstracting"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="(SCHEME=LCSH) Classification and
               Notation"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="(SCHEME=LCCS) ZA"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Subject" CONTENT="(SCHEME=LCCS) Z"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#subject"> 

               <META NAME="DC.Description" CONTENT="A library school web-based
               graduate course on bibliographic management of the WWW"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#description"> 

               <META NAME="DC.Publisher" CONTENT="School of Library and Information
               Studies, University of Oklahoma"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#publisher"> 

               <META NAME="DC.Date" CONTENT="(SCHEME=ISO8601) 2000-05-01"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#date"> 

               <META NAME="DC.Type" CONTENT="Text.Index"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#type"> 

               <META NAME="DC.Format" CONTENT="(SCHEME=IMT) text/html"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#format"> 
               <LINK REL=SCHEMA.imt HREF="http://sunsite.auc.dk/RFC/rfc/rfc2046.html"> 

               <META NAME="DC.Identifier" CONTENT="http://www.ou.edu/"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#identifier"> 

               <META NAME="DC.Language" CONTENT="(SCHEME=ISO639-1) en"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#language"> 

               <META NAME="DC.Coverage" CONTENT="metadata"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#coverage"> 

               <META NAME="DC.Rights" CONTENT="Copyright Wallace Koehler 2000 All
               Rights Reserved"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#rights"> 

               <META NAME="DC.Date.X-MetadataLastModified"
               CONTENT="(SCHEME=ISO8601) 2000-02-05"> 
               <LINK REL=SCHEMA.dc
               HREF="http://purl.org/metadata/dublin_core_elements#date"> 
 </head>

In the Body

In the body formal mark up is far more rare than in the head. Keywords and other data are marked with metadata tags in order to populate a record or template.

Let us take a block of text  from within this Web site. It has the following appearance in the browser or when printed out.
 

The course is designed to familiarize graduate students and working librarians with the various approaches to and problems
with bibliographic management of the Web strategies either at the author end or at the cataloger end. 

The first question necessarily must be "can we do it", followed quickly by "should we do it." If we can do and if we should do it, what is it we should do? These define the purpose of this course. 

 Purpose, Expectations, and
 Requirements

Yet imbedded within this text are numerous html codes, which look like this:
 
 
<a href="#contact"><img SRC="button_gif.gif" BORDER=0 X-CLARIS-USEIMAGEWIDTH X-CLARIS-USEIMAGEHEIGHT height=40 width=42 align=TEXTTOP></a>
<p>The course is designed to familiarize graduate students and working
librarians with the various approaches to and problems with bibliographic
management of the Web strategies either at the author end or at the cataloger
end.
p>The first question necessarily must be "can we do it", followed quickly
by "should we do it." If we can do and if we should do it, what is it we
should do? These define the purpose of this course.
<br>&nbsp;
<br>&nbsp;
<br>&nbsp;
<table BORDER=0 WIDTH="99%" BGCOLOR="#FFFFFF" >
<tr>
<td WIDTH="42" HEIGHT="136" BGCOLOR="#FFFFFF"></td>

<td WIDTH="99%" HEIGHT="136">
<h1>
<a href="pages/purpose.htm">Purpose</a>, <a href="pages/tools.html#expectations">Expectations</a>,
and <a href="pages/Require.htm">Requirements</a></h1>
 

From the point of view of the cataloger or indexer, none of this is particularly useful in  defining what the paragraph is "about" or how to manage its information content. It is telling us that there are control codes within the text, the size and location of graphics, and the addresses of linked objects.  The various SGML DTD based mark up languages add additional codes, different codes for different mark systems to specify metadata categories in the document.

Let's look again at the html text, this time with some "mark up." The document is color coded: green for title, red for author, lavender for corporate author and contact, andgray for abstract. This scheme could work, for the source code is not colorful, it carries a color code. For example, the title is rendered <font color="#009900">Web Document Management Course</font>. If our system were defined to populated a catalog's title field each time it encountered <font color="#009900">document title</font> we would have a start at a cataloging  mark up language. We would of course be precluded from using these four colors for other purposes. Moreover, our mark up lexicon would be as limited as our palette.
 
 
p>Welcome to the Web Document Management Course. It is designed to be
self-contained, sometimes synchronous, more often asynchronous Web based
course offered by <a href="http://www.ou.edu/cas/slis/ethics/CV/Index.htm">Dr.
Wallace Koehler</a>,
<a href="http://www.ou.edu/cas/slis">MLIS</a>, at the <a href="http://www.ou.edu">Valdosta State University</a>.<a href="#contact"><img SRC="button_gif.gif" BORDER=0 X-CLARIS-USEIMAGEWIDTH X-CLARIS-USEIMAGEHEIGHT height=40 width=42 align=TEXTTOP></a>
<p>The course is designed to familiarize graduate students and working
librarians with the various approaches to and problems with bibliographic
management of the Web strategies either at the author end or at the cataloger
end.
p>The first question necessarily must be "can we do it", followed quickly
by "should we do it." If we can do and if we should do it, what is it we
should do? These define the purpose of this course.
<br>&nbsp;
<br>&nbsp;
<br>&nbsp;
<table BORDER=0 WIDTH="99%" BGCOLOR="#FFFFFF" >
<tr>
<td WIDTH="42" HEIGHT="136" BGCOLOR="#FFFFFF"></td>

<td WIDTH="99%" HEIGHT="136">
<h1>
<a href="pages/purpose.htm">Purpose</a>, <a href="pages/tools.html#expectations">Expectations</a>,
and <a href="pages/Require.htm">Requirements</a></h1>

The value and ease of this kind of mark up language is that we can pluck metadata elements from the text of the document. Moreover we can place terms and concepts in context if our mark up language is sufficiently granular or rich. Consider if we were limited to mark up for subjects and had but one code it, say <s>subject</s>. Consider the sentence: "Henry Ford built Ford cars." To mark it up "Henry <s>Ford</s>  built <s>Ford</s> cars" would be ambiguous at best.

What if we could modify those terms, and mark them up as "personalname" and "productname" it might look like this: "Henry <spern>Ford</spern>  built <spron>Ford</spron> cars"

As yet we have not marked up the object "car," the verb "built," nor Ford's first name. We could. "Built" is the past tense of the transitive verb "to build." Maybe we could mark it <ptt>built</ptt>. The word "car" could be defined as a non-specific noun (<nsn> ?) and so on.

In theory we could mark up every word in a document. But consider, if we had a document with its first sentence marked up:  "<sperfn>Henry<sperfn> <sperln>Ford</sperln> <ptt>built</ptt> <spron>Ford</spron> <nsn>cars</nsn>" we would be able to populate a catalog record with a great deal of information and be able to specify and retrieve records describing actions in the past that involve a specified historical person manufacturing a specified product (cars, not trucks carrying the Ford logo).

As you can see, this gets pretty complicated in a hurry. But that's not the end. Let's address the word "car" and its classification. Let's redefine it. "Car" belongs to the phylum "vehicle," genus "land," species "motorized," race "four-wheel." We might classify it nvlmf. Thus, we might mark the sentence up as "<sperfn>Henry<sperfn> <sperln>Ford</sperln> <ptt>built</ptt> <spron>Ford</spron> <nvlmf>cars</nvlmf>" If our document were only about Henry Ford building cars, we might not need to repeat the code nor add others. Why go to the trouble. Surely the character string "car" can be searched. But that is more ambiguous than "nvlmf AND car." And if your return set is numbered in the tens of thousands as it often is on the Web, anything that narrows toward relevance should be considered.

Would Web authors be willing to adopt such mark up language. I think the answer is yes and no. In the "early years" of word processing, we had to remember all sorts of key strokes to create documents. Word processing is now about as complex as typing. Using an html text editor is also much more straight forward than it was not too long ago. It is possible to build a text editor that can imbed the appropriate code using drop down menus other other aids. Will we do it?
 

WWW Mark Up Records

IAFA Templates

In conjunction with the Internet Engineering Task Force (IETF) and others, David Beckett (1995?) sought to develop a metadata standard for Web documents. It utilizes elements, element clusters, and handles to define document data. IAFA template does not appear to have been implemented.
 

ALIWEB

ALIWEB is a Web search engine described at http://aliweb.emnet.co.uk/ It describes itself as "the oldest and cleanest search engine on the Web!"   ALIWEB is an IAFA template based Web document surrogate mark up system. Various Web creators and users  create template mark up then register the template with ALIWEB for inclusion in the service. An annotated copy of the template is found at http://aliweb.emnet.co.uk/siteidx.html  Once registered, ALIWEB periodically scans the index file maintained by the Web document creator and thereby makes updates to its index automatically. The ALIWEB index is therefore as current as the updates Web creators are willing to make.

ALIWEB does not scan the native page. Its weakness is that it is dependent upon Web page creators to maintain the integrity of the bibliographic record. But it is also one of its strengths, for the record is author created.

The template consist of five elements or tags. These are:
 
 
 

     "SITEINFO for a record containing information about the server
      ORGANIZATION for a record containing information about the organisation
      DOCUMENT for a record containing information on documents (pages) on the server
      SERVICE for a record containing information on services available on the server
      USER for a record containing information on users at the site"

source: http://aliweb.emnet.co.uk/siteidx.html

An example of a fully marked record can be found at http://www.nexor.com/site.idx

Harvest Summary Object Interchange Format (SOIF)

SOIF uses are rather arcane vocabulary to describes is chief "players." Gatherers -- usually robots -- create records of  individual objects in SOIF. They in turn provides  these records to Brokers.  Brokers, in turn, provide collect and indexing services.  Note that unlike ALIWEB, SOIF services are not creator based. Note also the greater degree of complexity in the element set.
 

The Harvest SOIF contains twenty element:
 
 
 

Abstract   Brief abstract about the object. 
Author  Author(s) of the object. 
Description Brief description about the object
File-Size Number of bytes in the object.
Full-Text  Entire contents of the object. 
Gatherer-Host  Host on which the Gatherer ran to extract information from the object. 
Gatherer-Name Name of the Gatherer that extracted information from the object. (eg. Full-Text, Selected-Text, or Terse).
Gatherer-Port  Port number on the Gatherer-Host that serves the Gatherer's information. 
Gatherer-Version Version number of the Gatherer. 
Keywords Searchable keywords extracted from the object. 
Last-Modification-Time  The time that the object was last modified (in seconds since epoch).
MD5 MD5 16-byte checksum of the object.
Partial-Text Only the selected contents from the object
Refresh-Rate How often the Broker attempts to update the content summary (in seconds relative to Update-Time).
Time-to-Live  How long content summary is valid (in seconds relative to Update-Time).
Title Title of the object. 
Type  Example: Archive, Audio, Awk, Backup, Binary, C, CHeader, Command,
Compressed, CompressedTar, Configuration, Data, Directory, DotFile, Dvi, FAQ, FYI, Font, FormattedText, GDBM, GNUCompressed, GNUCompressedTar, HTML, Image, Internet-Draft, MacCompressed, Mail, Makefile, ManPage,Object, OtherCode, PCCompressed, Patch, Perl, PostScript, RCS, README, RFC, SCCS, ShellArchive, Tar, Tcl, Tex, Text, Troff, Uuencoded, and WaisSource
Update-Time  The time that Gatherer updated (generated) the content summary from the object (in seconds since the epoch).
URL  URL of the object. 
URL-References Any URL references present within HTML objects. 

                              Source: http://xtal1.sdsc.edu/Harvest/brokers/Attributes.html

     Harvest/SOIF appears to be no longer supported at the University of Colorado, Boulder.


David Beckett, IAFA Templates in Use as Internet Metadata. Available: http://www.w3j.com/1/a.052/paper/052.html

Darren R. Hardy, Michael F. Schwartz,  Duane Wessels, Harvest User's Manual, v 1.2 dated April 1995. Available: http://www.epi.mh-hannover.de/~bueker/harvest/user-manual/user-manual.html


PUBLISHER's MARKUP

ONIX International


ONIX International (Online Information exchange) is a metadata standard developed for and by publishers and online book sellers. It is an Internet compliant standard. The standard was first released in 2000. It can be used to describe both traditional and e-publications. It is an XML based markup language supported its own document type definition .

There are a number of specialized RDF applications under development. For example, the Information and Content Exchange (ICE) is a mark up language designed as a B2B (business to business) publishing standard "for use by content syndicators and their subscribers." (http://www.w3.org/TR/NOTE-ice).

The RSS (RDF/Rich Site Summary) is an RDF-based metadata system for news transfer and mark up.

PRISM (Publishing Requirements for Industry Standard Metadata) is an RDF-based software publishing industry mark up standard.

NewsML, developed by the International Press Telecommunications Council, is similar to RSS and facilitates news data exchanges. It is not, however, an RDF- based system.

NITF was developed as a complement to NewsML. NITF is an html-based mark up for the transfer of textual news stories. For additional information on these applications, see http://www.xmlnews.org/ and http://www.ilrt.bris.ac.uk/discovery/rdf-dev/roads/subject-listing/rdfapps.html.


Digital Geospatial Metadata

Development of Digital Geospatial Metadata result in part from Executive Order of the President 12906, 1994 "Coordinating Geographic Data Acquisition and Access: The National Spatial Data Infrastructure.” The Executive Order called upon US government agencies to develop a standardized naming scheme for geospatial data. Federal geospatial metadata development is managed by an interagency committee, the Federal Geographic Data Committee. The metadata set provides for identification, use, format, spacial referents, temporal, and other data. For a list of elements see http://www.fgdc.gov/metadata/fgdc-std-001-1998.dtd. For further information, see http://www.fgdc.gov/metadata/contstan.html.

A crosswalk between FGDC Digital Geospatial Metadata to USMARC can be found at http://www.alexandria.ucsb.edu/public-documents/metadata/fgdc2marc.html.


Archival Mark Up

Most archival mark up languages are in XML, a sub-class of SGML based and follow the header standard, that is, the mark up or description is placed between the <head></head> portion of the html. Most follow the standards laid down by TEI. Note that MEP supports both header and document markup. What follows here is a rudimentary overview of each of the archival mark up systems. For more in depth review, follow the links provided.

As has already been pointed out, the archival and historical community have been active in document mark up. The following list is not exhaustive but these are important applications.

The Corpus Encoding Standard

CES provides encoding to interpret and represent text. It employs a markup metalanguage (http://www.cs.vassar.edu/CES/CES1-1.html#ToCDef) a markup scheme consisting of a "triple." A triple contains a character set, syntax, and semantics.

CES follows TEI standards and is described as a TEI subset. It is described as "simplified" TEI.

The CES Header acts as an "electronic title page" and consists of a extensive element tree. For an example of the CES markup for George Orwell's 1984, go to the bottom of the Web page at http://www.cs.vassar.edu/CES/CES1-3.html
 

Model Editions Partnership

MEP is described in a 1995 article by David Chesnutt in D-Lib Magazine entitled The Model Editions Partnership[:]        Historical Editions in the Digital Age, available at: http://www.dlib.org/dlib/november95/11chesnutt.html. It is a consortium of seven projects to mark up historical documents. MEP is designed to reflect current historical usage through its markup.

Markup guidelines are found at http://MEP.cla.sc.edu/MepGuide.html Like other archival markup languages, MEP conforms to TEI standards.

MEP consists of both header definition as well as a textual markup within the body of the document. MEP, like the others, placed bibliographic definition between its <mepHeader></mepHeader>, but also marks up the electronic surrogate of the document itself between <doc></doc> metatags.

Encoded Archival Description  Document Type Definition

Created in 1994 and supported by the Society of American Archivists,  EAD represents an extensive markup metalanguage. For a list of terms, see http://lcweb.loc.gov/ead/tglib/tlelem.html EAD is considered a "finding aid" rather than a definitive electronic description of the native document.

EAD consists of three main elements: header <eadheader></eadheader>, frontmatter <frontmatter>, and archival description <archdesc>. The EAD header consists of tags that define elements and atributes  and follow TEI standards. Frontmatter is seen as a bibliographic descripter, but less formal in structure than the header taht need not follow TEI format. Archival description contains information about the document body. (see http://lcweb.loc.gov/ead/tglib/tlover.html)
 

Text Encoding Initiative

TEI is a system to document the bibliographic content of documents. This information is provided within the TEI header (see http://etext.lib.virginia.edu/bin/tei-tocs?div=DIV1&id=HD). Depending on the document and the degree of bibliographic control sought, the TEI header can be a fairly simple or a very complex statement, as shown in the following table:
 
 
 
<teiHeader>
    <fileDesc>
         <titleStmt>Title</titleStmt>
         <editionStmt>Edition</editionStmt>
         <extent>Extent</extent>
         <publicationStmt>Pub</publicationStmt>
         <seriesStmt> Series</seriesStmt>
         <notesStmt>Notes</notesStmt>
         <sourceDesc>Source </sourceDesc>
    </fileDesc>
    <more header here>
</teiHeader>

 

The TEI header contains a series of elements. The first is always the file description <fileDesc>. It is designed to provide standard bibliographic referants as shown above.

The second major element is the Encoding Description <encodingDesc> . It is an optional description of methods and editorial
principles underlying the document description. There are nine optional subdivisions to the <encodingDesc>

The third major element is the  Profile Description  <profileDesc>. This too is a optional element and includes information surrounding a document of a non-bibliographic nature. These include various subelements that describe how a document came to be <creation>, languages and sub-languagesfound in the text <langUsage> , and descriptions of the nature of the text <textClass> . There are addtional subelements.

The fourth and final element is the Revision Description <revisionDesc>. This last element pertains not to the described document, but rather to revisions to the description. It is, if you will, a provinance of the bibliographic document description. It too is optional.



For an extensive list of specific library projects, see Michael's Website, Libraries and Electronic Archives available at: http://www.wam.umd.edu/~mlhall/library.html