Writing in the Sand Text Encoding Initiative (TEI) in Digital Libraries

TEI ( Text Encoding Initiatative) is currently the best metadata choice for transcribing text for analysis and reading. Text documents can be very complex in structure, and TEI accomodates their remarkable variabilty, as well as providing the capability of adding very detailed markup. Since most older documents of interest are handwritten, transcription is often necessary simply to make them readable via the web (or even in person, for the untrained eye). These documents are not suitable for OCR (Optical Character Reader) processing, which can extract text from images of typewritten pages.

In these older documents, everything is of interest to one scholar or another. Some want to study the variations in how letters and documents of the time are written and addressed; some are interested in the abbreviations, the type of wording chosen, the misspellings, the handwriting, the additions and deletions, crossed-out text, images added, gaps and damage, or simply the paper or skins used, and the writing implements and medium. Others are interested in the message content for cultural, historical, lingusitic, social, or other research. TEI allows for multiple levels of detail in both tagging and in hierarchy; a TEI document can therefore range from extremely simple to remarkably complex. The five encoding levels are as follows:

  1. Level 1: Fully Automated Conversion and Encoding
  2. Level 2: Minimal Encoding
  3. Level 3: Simple Analysis
  4. Level 4: Basic Content Analysis
  5. Level 5: Scholarly Encoding Projects

The University of Tennessee currently has only one collection online that is as complex as level 4 tagging, and it was developed in TEI Lite, which is a subset of the possible tags available in TEI. Here's a sample (level 4) TEI Lite document. Just inside the TEI.2 tag which identifies the file as "sl072", the first section is the

Markup enables the delivery software to provide extra information about the tagged words or phrases, and it also enables the delivery software to index the tagged information. For example, if every date in the text is tagged with <date> tags, then those values can be indexed separately, so a user can search only within the date values. This same file is viewable online in display software. You can see the marked up sections in green font, correcting misspellings (so a user can search on the correctly spelled word), noting additions, gaps, expansions, deletions, and more. By clicking on "view page image" you can see the letter that was transcribed, and page through it there as well. The DTD we used (a DLXS version of TEI Lite) is here.

Although TEI, like EAD (Encoded Archival Description) was first developed in Standard Generalized Markup Language (SGML), the more recent editions (now working on version 1.0; the 0.5 version was released in July 2006) are of course marked up in XML (Extensible Markup Language), which is much more functional for computer processing.

TEI is developed opensource, and their development site is on Sourceforge. Guidelines for use of the different versions of TEI can be found here.

Here is the list of links from the discussion above:

Recommended listserv: TEI.

Display software used is modified DLXS (Digital Library Extension Service).

Girl writing in the sand

This information is provided without guarantees as to validity or completeness, particularly in light of the fact that the world of metadata in digital libraries is a world of shifting sands, constantly changing.