Friday, 1 April 2005

Relaunch of NZETC collection using Topic Maps

The website of the New Zealand Electronic Text Centre has been
re-launched, with a new topic map infrastructure based on TM4J.

The website is a digital library, providing access to a couple of
hundred digitised books and manuscripts. The site has been running for
about 3 years, but this week we've upgraded it significantly, putting it
on a new foundation - a topic map.
The topic map presently contains 46807 topics, 192492 associations, and
43942 occurrences; roughly 150Mb of XTM. We are using TM4J as out topic
map server, using TM4J's "in-memory" back-end, running on Java 1.4.1 on
Windows 2000. The topic map consumes approximately 1.3GB of RAM.

The source material for the site is a collection of TEI (Text Encoding
for Interchange) XML files, each of which is an encoding of a source
object (i.e. a book). Most of the topic map is harvested from these
files using XSLT. Each book, chapter, subsection, figure, author,
publisher, etc, is represented by a topic, names are harvested from
headings and captions in the text, and the containment hierarchy is
represented by associations. These associations are used to generate
tables of contents, as well as to provide "next" and "previous" links
between web pages.
For each fragment of TEI text, we harvest 2 HTML occurrences which are
alternative representations of that piece of text. One is a "scholarly"
(fussy) view, in which page numbers, errors, deletions and corrections
(in manuscripts), etc are all rendered, and the other is a "basic"
(simplified) view, in which spelling errors are silently corrected, page
numbers are not displayed, etc. These alternatives are distinguished
with "basic" and "scholarly" scoping topics. At present only the
scholarly view is visible on the public website, but we plan to make the
basic view visible during next week. Cocoon XSLT pipelines are used to
transform the TEI into HTML (and some other formats).

Names of people, places, etc, are also marked up in the TEI, and these
are also harvested as topics, with associations linking each person to
the places in the texts where they are mentioned, the figures in which
they are depicted, and to the texts which they wrote. We use a MADS XML
file to maintain an authoritative list of names, from which we also
harvest some biographical notes and links to external websites.
Consequently, the system can generate a web page to represent each
person, providing links to all the places in the library where they are
mentioned, all the texts they wrote, and a thumbnail gallery of the
pictures in which they appear, and links to relevant external sites.

The ontology used is a subset of the CIDOC CRM (a museum ontology).

The front end of the site uses Cocoon to render pages (each of which
represents a topic, and some "neighbouring" topics). We use Cocoon's
templating system "jxtemplate" to render each topic. JXTemplate is
designed to be very like XSLT, with an expression language called
"JXPath" which is more-or-less a superset of XPath, but which also
allows for traversal of Java objects via path expressions, e.g.
"$topic/occurrences[type=$ontology/html]". This avoids the conceptual
mis-match that can occur when using XSLT, which is tree-oriented, to
style XTM, which really represents a cyclic graph. We had to write a few
Java functions to add JXPath support for topic sorting, traversal of the
type hierarchy, and a few other features, but nothing too hard. We use
several different templates to render the different types of topics.

In future we plan to harvest dates from the texts, and provide
timeline-based access to the texts. Our main technical concern is to
replace the in-memory topic map with a database, since we need to scale
up the topic map as our collection grows, and as we add more semantic
markup to the TEI.
Thanks very much to the members of TopicMapMail who have been an
invaluable resource during the (several month) gestation of the new