Writing in the Sand Making it all Work in Digital Libraries

Digital libraries must use URLs (Uniform Resource Locators) to provide access to online materials; as we move into sharing our materials, we must find ways to support Persistent Identifiers so that the same item can still be accessed from old links though the software has changed, and so items can be uniquely identified regardless of current location.

As digital libraries begin to expand to include scientific papers and materials in languages beyond standard English, there is a growing recognition of the need for Unicode support. Current software often requires Unicode for non- ASCII text, and browsers as yet only support this emerging standard with marked limitations.

All the forms of metadata described in these pages are encoded in XML, a basic method of machine encoding of text.

Each of these is briefly introduced here, with links to further resources.


Persistent Identifiers for digital objects could be used to ensure the avoidance of "broken links" for consumers via the web, and to enable management and identification of materials over time.

In 1992, the IETF (Internet Engineering Task Force) created a standard for addressing objects with the institutional committment to support them as persistent, location-independent resource identifiers. This standard is the Uniform Resource Name (URN) (namespace specifications here), and the common structure is: URN:NID:NISS, where NID is the Namespace IDentifier, and NISS is the Namespace Identifying Specific String.

In 1995, OCLC (Online Computer Library Center) introduced Persistent Uniform Resource Locator (PURL) a naming and resolution service using URLs (Uniform Resource Locators) but supporting the URN technology.

In 2001, ANSI/NISO (American National Standards Institute in conjunction with National Information Standards released Syntax for the Digital Object Identifier (DOI), which was developed in 2000 by the International DOI Foundation (IDF), a non-profit organization that coordinates applications. Unique identifiers must be registered centrally to ensure globaly uniqueness, and to provide for resolution, as an institution may fail, leaving its resources to another. An example of a registration service for managing the resolution of DOIs would be CrossRef.

The Handle System was developed in 2003 by the Corporation for National Research Initiatives (CNRI) , to provide a method for redirection to the object, but it can resolve to multiple locations or multiple versions of the object. The structure of the handles is that of a numerical code prefix identifying the institution, followed by a character string suffix which identifies the object:
Handle ::= Handle Naming Authority "/" Handle Local Name

The Internet Telecommunication Union put out a recommendation for the generation, registration, and use of Universally Unique Identifiers (UUIDs) in 2004, which has become an i international standard.

John Kunze of California Digital Library has proposed the Archival Resource Key (ARK) identifier for digital objects. The latest (2006) specifications are here, in an Internet-Draft proposal to the IETF (Internet Engineering Task Force).

In addition, there are efforts to piggyback the globally unique identifier resolution on that of the Domain Name System (DNS) that is used to resolve all URLs. The most recent is the Dynamic Delegation Discovery System (DDDS); the Requests For Comments for the specifications are RFC 3401, RFC 3402, RFC 3403, RFC 3404, and RFC 3405.



Unicode was developed to provide a single encoding for each and every character in all the major languages written today. Before Unicode was developed, there were many different encodings in use, and they often encoded the same thing different ways, creating chaos. In order for computers to process text, or even display it properly, they must have a set method of rendering or interpreting each character. There are three character encoding forms, corresponding to the number of bits used per code unit:

The latest version is 5.0.0, and the code charts are available here. These charts give the hexadecimal encoding for characters; for example, the Unicode hexadecimal encoding for the number "3" is 0033, which you could write into your html as 3
(Without the "x" it is not hexadecimal, and in fact is then likely to display an exclamation point instead!) Actual machine-level encoding will not leave this hexadecimal version viewable to users. Levels of browser support for Unicode vary; here are some instructions on enabling Unicode in your browser. For a further discussion of Unicode and how to use it, see Alan Wood's Unicode Resources.



XML (Extensible Markup Language) is a simple, flexible method of encoding text for machine interoperability, while retaining human readability, and offering ease of creation. The fourth edition of the XML 1.0 specifications were released in August 2006. XML was derived from SGML (Standard Generalized Markup Language), of which HTML (HyperText Markup Language) is a subset. A discussion of their relationship is available here. An XML tutorial is here.

XML documents use a simple syntax. The first line is always the XML declaration which defines the version and the character encoding, such as:
<?xml version="1.0" encoding="ISO-8859-1"?>

After this comes the <root> element, which contains all other elements. All of the enclosed elements can have sub elements, as long as they are correctly nested within their parent elements.

                 <root>
                    <parent>
                        <child>
	                   <subchild> . . . </subchild>
                        </child>
                    </parent>
                 </root>
               

Thus, XML is well suited to use in organizing hierarchical objects such as books:

                 <book>
                    <information>
                       <title>A Book on XML</title>
                       <author>Someone Besides Me</author>
                    </information>
                    <contents>
                      <chapter number="1">What's XML all about?</chapter>
                      <chapter number="2">Why do we need another standard?</chapter>
                      <chapter number="3">Simple Syntax</chapter>  
                    </contents>
                 </book>
               

Obviously, a real book would contain much more information, and several more chapters, but this serves to illustrate the nesting of elements and the self-descriptive nature of those elements. The "/" inside the element tag serves to indicate that this is the end of the element whose name follows; so the element <chapter number="1"> begins with that tag and ends with the </chapter> tag that follows. Any elements that begin between those two tags must also end between those two tags. That is what is meant by "nesting". All attributes must be double-quoted, and all tags are case-sensitive. An online syntax-checking service is available. Using a DTD (Document Type Definition) or an XML Schema, you can specify the legal elements and attributes, as well as their usage, within your document. Validating against the DTD or schema ensures that the document meets the specified requirements. Using the same DTD or schema across institutions ensures machine interoperability, as all the documents will use the elements and attributes in the same way.

The Organization for the Advancement of Structured Information Standards (OASIS), a non-profit international organization that produces many web services standards, offers an informative website on the uses and applications of XML, as well as an "Online resource for markup language technologies" in general.

Girl writing in the sand

This information is provided without guarantees as to validity or completeness, particularly in light of the fact that the world of metadata in digital libraries is a world of shifting sands, constantly changing.